Revisiting Norm Estimation in Data Streams
نویسندگان
چکیده
We revisit the problem of (1±ε)-approximating the Lp norm, 0 ≤ p ≤ 2, of an n-dimensional vector updated in a stream of length m with positive and negative updates to its coordinates. We give several new upper and lower bounds, some of which are optimal. LOWER BOUNDS.We show that for the interesting range of parameters, Ω(ε log(nm)) bits of space are necessary for estimating Lp in one pass for any real constant p ≥ 0. If p is strictly positive, the lower bound improves to Ω(ε log(nmM)), where updates are in the set {−M, . . . ,M}. Our results imply space-optimality of the celebrated L2 estimation algorithm of [Alon, Matias, Szegedy JCSS 1999] and L1-difference algorithm of [Feigenbaum et al., SICOMP 2002], as well as a separation in the complexity of L0-estimation between the insertion-only and turnstile models since [Bar-Yossef et al., RANDOM 2002] give an Õ(ε + logn) space algorithm for the former. Our techniques also improve the best known lower bound for additive entropy approximation, and variants were used in [Clarkson, Woodruff, 2008] to obtain tight space bounds for streaming linear algebra problems. ALGORITHMS. We give the first one-pass streaming algorithm for estimating L0, the number of distinct elements in the turnstile model, in optimal space up to a constant factor. Our algorithm uses O(ε log(nmM)) space and has O(log(mM)) update time. For estimating Lp, 0 < p < 2, we improve the algorithm of [Indyk, J. ACM ’06] by reducing the space from O(ε log(nmM) log(n)) to O(ε log(nmM) log(n)/ log(1/ε)). In light of our lower bounds, this is optimal for any ε = n. We accomplish this by showing that a pseudorandom generator (PRG) construction of [Armoni, RANDOM 1998] can be improved by using a more space-efficient implementation of a recent extractor of [Guruswami, Umans, Vadhan, CCC 2007]. Specifically, the improved Armoni PRG stretches a seed ofO((S/(log(S)−log log(R)+O(1))) logR) bits to R bits fooling space-S algorithms for any R = 2, improving the O(S logR) seed length of the PRG of [Nisan, Combinatorica 1992], and improving several existing streaming algorithms. MIT Computer Science and Artificial Intelligence Laboratory. [email protected]. Supported by a National Defense Science and Engineering Graduate (NDSEG) Fellowship. Much of this work was done while the author was at the IBM Almaden Research Center. IBM Almaden Research Center, 650 Harry Road, San Jose, CA, USA. [email protected].
منابع مشابه
Comparing Data Streams Using Hamming Norms (How to Zero In)
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases, and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on ma...
متن کاملData Streams with Bounded Deletions
Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. This complexity gap often arises because the underlying frequency vector f is very close to 0, after accounting for all insertions and deletions...
متن کاملLarge-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation
In this paper a fast method for large-scale sparse inversion of magnetic data is considered. The L1-norm stabilizer is used to generate models with sharp and distinct interfaces. To deal with the non-linearity introduced by the L1-norm, a model-space iteratively reweighted least squares algorithm is used. The original model matrix is factorized using the Golub-Kahan bidiagonalization that proje...
متن کاملCs-621 Theory Gems
So far, we have seen streaming algorithms for two important variants of Lp-norm estimation problem: L0-norm estimation (the distinct elements problem) and L2-norm estimation. We also noted that the L1norm estimation problem (at least, when we do not allow element deletions) corresponds to just computing the length of the stream and thus can be trivially solved in O(log n) space. Therefore, the ...
متن کاملRevisiting the Direct Sum Theorem and Space Lower Bounds in Random Order Streams
Estimating frequency moments and Lp distances are well studied problems in the adversarial data stream model and tight space bounds are known for these two problems. There has been growing interest in revisiting these problems in the framework of random-order streams. The best space lower bound known for computing the k frequency moment in random-order streams is Ω(n1−2.5/k) by Andoni et al., a...
متن کاملRevisiting Frequency Moment Estimation in Random Order Streams
We revisit one of the classic problems in the data stream literature, namely, that of estimating the frequency moments Fp for 0 < p < 2 of an underlying n-dimensional vector presented as a sequence of additive updates in a stream. It is well-known that using p-stable distributions one can approximate any of these moments up to a multiplicative (1 + )-factor using O( −2 log n) bits of space, and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/0811.3648 شماره
صفحات -
تاریخ انتشار 2008